Group Members:
This project is an example of using customer segmentation to define a marketing strategy. In this sample data set which summarizes the usage behavior of approximately 9000 active credit card holders in the last 6 months, our aim is to perform this segmentation in the most accurate way.
## CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
## 1 C10001 40.90075 0.818182 95.40 0.00
## 2 C10002 3202.46742 0.909091 0.00 0.00
## 3 C10003 2495.14886 1.000000 773.17 773.17
## 4 C10004 1666.67054 0.636364 1499.00 1499.00
## 5 C10005 817.71434 1.000000 16.00 16.00
## 6 C10006 1809.82875 1.000000 1333.28 0.00
## INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## 1 95.40 0.000 0.166667
## 2 0.00 6442.945 0.000000
## 3 0.00 0.000 1.000000
## 4 0.00 205.788 0.083333
## 5 0.00 0.000 0.083333
## 6 1333.28 0.000 0.666667
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 1 0.000000 0.083333
## 2 0.000000 0.000000
## 3 1.000000 0.000000
## 4 0.083333 0.000000
## 5 0.083333 0.000000
## 6 0.000000 0.583333
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT
## 1 0.000000 0 2 1000
## 2 0.250000 4 0 7000
## 3 0.000000 0 12 7500
## 4 0.083333 1 1 7500
## 5 0.000000 0 1 1200
## 6 0.000000 0 8 1800
## PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
## 1 201.8021 139.5098 0.000000 12
## 2 4103.0326 1072.3402 0.222222 12
## 3 622.0667 627.2848 0.000000 12
## 4 0.0000 NA 0.000000 12
## 5 678.3348 244.7912 0.000000 12
## 6 1400.0578 2407.2460 0.000000 12
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.
## 'data.frame': 8950 obs. of 18 variables:
## $ CUST_ID : chr "C10001" "C10002" "C10003" "C10004" ...
## $ BALANCE : num 40.9 3202.5 2495.1 1666.7 817.7 ...
## $ BALANCE_FREQUENCY : num 0.818 0.909 1 0.636 1 ...
## $ PURCHASES : num 95.4 0 773.2 1499 16 ...
## $ ONEOFF_PURCHASES : num 0 0 773 1499 16 ...
## $ INSTALLMENTS_PURCHASES : num 95.4 0 0 0 0 ...
## $ CASH_ADVANCE : num 0 6443 0 206 0 ...
## $ PURCHASES_FREQUENCY : num 0.1667 0 1 0.0833 0.0833 ...
## $ ONEOFF_PURCHASES_FREQUENCY : num 0 0 1 0.0833 0.0833 ...
## $ PURCHASES_INSTALLMENTS_FREQUENCY: num 0.0833 0 0 0 0 ...
## $ CASH_ADVANCE_FREQUENCY : num 0 0.25 0 0.0833 0 ...
## $ CASH_ADVANCE_TRX : int 0 4 0 1 0 0 0 0 0 0 ...
## $ PURCHASES_TRX : int 2 0 12 1 1 8 64 12 5 3 ...
## $ CREDIT_LIMIT : num 1000 7000 7500 7500 1200 1800 13500 2300 7000 11000 ...
## $ PAYMENTS : num 202 4103 622 0 678 ...
## $ MINIMUM_PAYMENTS : num 140 1072 627 NA 245 ...
## $ PRC_FULL_PAYMENT : num 0 0.222 0 0 0 ...
## $ TENURE : int 12 12 12 12 12 12 12 12 12 12 ...
Hmm, all variables are numeric other than CUST_ID which will be removed for clustering.
## CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES
## Length:8950 Min. : 0.0 Min. :0.0000 Min. : 0.00
## Class :character 1st Qu.: 128.3 1st Qu.:0.8889 1st Qu.: 39.63
## Mode :character Median : 873.4 Median :1.0000 Median : 361.28
## Mean : 1564.5 Mean :0.8773 Mean : 1003.20
## 3rd Qu.: 2054.1 3rd Qu.:1.0000 3rd Qu.: 1110.13
## Max. :19043.1 Max. :1.0000 Max. :49039.57
##
## ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE
## Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 38.0 Median : 89.0 Median : 0.0
## Mean : 592.4 Mean : 411.1 Mean : 978.9
## 3rd Qu.: 577.4 3rd Qu.: 468.6 3rd Qu.: 1113.8
## Max. :40761.2 Max. :22500.0 Max. :47137.2
##
## PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY
## Min. :0.00000 Min. :0.00000
## 1st Qu.:0.08333 1st Qu.:0.00000
## Median :0.50000 Median :0.08333
## Mean :0.49035 Mean :0.20246
## 3rd Qu.:0.91667 3rd Qu.:0.30000
## Max. :1.00000 Max. :1.00000
##
## PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX
## Min. :0.0000 Min. :0.0000 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.000
## Median :0.1667 Median :0.0000 Median : 0.000
## Mean :0.3644 Mean :0.1351 Mean : 3.249
## 3rd Qu.:0.7500 3rd Qu.:0.2222 3rd Qu.: 4.000
## Max. :1.0000 Max. :1.5000 Max. :123.000
##
## PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS
## Min. : 0.00 Min. : 50 Min. : 0.0 Min. : 0.02
## 1st Qu.: 1.00 1st Qu.: 1600 1st Qu.: 383.3 1st Qu.: 169.12
## Median : 7.00 Median : 3000 Median : 856.9 Median : 312.34
## Mean : 14.71 Mean : 4494 Mean : 1733.1 Mean : 864.21
## 3rd Qu.: 17.00 3rd Qu.: 6500 3rd Qu.: 1901.1 3rd Qu.: 825.49
## Max. :358.00 Max. :30000 Max. :50721.5 Max. :76406.21
## NA's :1 NA's :313
## PRC_FULL_PAYMENT TENURE
## Min. :0.0000 Min. : 6.00
## 1st Qu.:0.0000 1st Qu.:12.00
## Median :0.0000 Median :12.00
## Mean :0.1537 Mean :11.52
## 3rd Qu.:0.1429 3rd Qu.:12.00
## Max. :1.0000 Max. :12.00
##
## [1] 314
Where is my missing NA values?
I know where is my missing values, so I would like to replace that values with 0.
## CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES
## Length:8950 Min. : 0.0 Min. :0.0000 Min. : 0.00
## Class :character 1st Qu.: 128.3 1st Qu.:0.8889 1st Qu.: 39.63
## Mode :character Median : 873.4 Median :1.0000 Median : 361.28
## Mean : 1564.5 Mean :0.8773 Mean : 1003.20
## 3rd Qu.: 2054.1 3rd Qu.:1.0000 3rd Qu.: 1110.13
## Max. :19043.1 Max. :1.0000 Max. :49039.57
##
## ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE
## Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 38.0 Median : 89.0 Median : 0.0
## Mean : 592.4 Mean : 411.1 Mean : 978.9
## 3rd Qu.: 577.4 3rd Qu.: 468.6 3rd Qu.: 1113.8
## Max. :40761.2 Max. :22500.0 Max. :47137.2
##
## PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY
## Min. :0.00000 Min. :0.00000
## 1st Qu.:0.08333 1st Qu.:0.00000
## Median :0.50000 Median :0.08333
## Mean :0.49035 Mean :0.20246
## 3rd Qu.:0.91667 3rd Qu.:0.30000
## Max. :1.00000 Max. :1.00000
##
## PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX
## Min. :0.0000 Min. :0.0000 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.000
## Median :0.1667 Median :0.0000 Median : 0.000
## Mean :0.3644 Mean :0.1351 Mean : 3.249
## 3rd Qu.:0.7500 3rd Qu.:0.2222 3rd Qu.: 4.000
## Max. :1.0000 Max. :1.5000 Max. :123.000
##
## PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS
## Min. : 0.00 Min. : 50 Min. : 0.0 Min. : 0.0
## 1st Qu.: 1.00 1st Qu.: 1600 1st Qu.: 383.3 1st Qu.: 163.0
## Median : 7.00 Median : 3000 Median : 856.9 Median : 289.6
## Mean : 14.71 Mean : 4494 Mean : 1733.1 Mean : 834.0
## 3rd Qu.: 17.00 3rd Qu.: 6500 3rd Qu.: 1901.1 3rd Qu.: 788.7
## Max. :358.00 Max. :30000 Max. :50721.5 Max. :76406.2
## NA's :1
## PRC_FULL_PAYMENT TENURE
## Min. :0.0000 Min. : 6.00
## 1st Qu.:0.0000 1st Qu.:12.00
## Median :0.0000 Median :12.00
## Mean :0.1537 Mean :11.52
## 3rd Qu.:0.1429 3rd Qu.:12.00
## Max. :1.0000 Max. :12.00
##
As we see we also see 1 NA value on CREDIT LIMIT. As we know from banking system, CREDIT LIMIT can not be 0, I would like to give a mean value to this NA.
cc_data$CREDIT_LIMIT[which(is.na(cc_data$CREDIT_LIMIT))] <- mean(cc_data$CREDIT_LIMIT,
na.rm=TRUE)
summary(cc_data)## CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES
## Length:8950 Min. : 0.0 Min. :0.0000 Min. : 0.00
## Class :character 1st Qu.: 128.3 1st Qu.:0.8889 1st Qu.: 39.63
## Mode :character Median : 873.4 Median :1.0000 Median : 361.28
## Mean : 1564.5 Mean :0.8773 Mean : 1003.20
## 3rd Qu.: 2054.1 3rd Qu.:1.0000 3rd Qu.: 1110.13
## Max. :19043.1 Max. :1.0000 Max. :49039.57
## ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE
## Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 38.0 Median : 89.0 Median : 0.0
## Mean : 592.4 Mean : 411.1 Mean : 978.9
## 3rd Qu.: 577.4 3rd Qu.: 468.6 3rd Qu.: 1113.8
## Max. :40761.2 Max. :22500.0 Max. :47137.2
## PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY
## Min. :0.00000 Min. :0.00000
## 1st Qu.:0.08333 1st Qu.:0.00000
## Median :0.50000 Median :0.08333
## Mean :0.49035 Mean :0.20246
## 3rd Qu.:0.91667 3rd Qu.:0.30000
## Max. :1.00000 Max. :1.00000
## PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX
## Min. :0.0000 Min. :0.0000 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 0.000
## Median :0.1667 Median :0.0000 Median : 0.000
## Mean :0.3644 Mean :0.1351 Mean : 3.249
## 3rd Qu.:0.7500 3rd Qu.:0.2222 3rd Qu.: 4.000
## Max. :1.0000 Max. :1.5000 Max. :123.000
## PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS
## Min. : 0.00 Min. : 50 Min. : 0.0 Min. : 0.0
## 1st Qu.: 1.00 1st Qu.: 1600 1st Qu.: 383.3 1st Qu.: 163.0
## Median : 7.00 Median : 3000 Median : 856.9 Median : 289.6
## Mean : 14.71 Mean : 4494 Mean : 1733.1 Mean : 834.0
## 3rd Qu.: 17.00 3rd Qu.: 6500 3rd Qu.: 1901.1 3rd Qu.: 788.7
## Max. :358.00 Max. :30000 Max. :50721.5 Max. :76406.2
## PRC_FULL_PAYMENT TENURE
## Min. :0.0000 Min. : 6.00
## 1st Qu.:0.0000 1st Qu.:12.00
## Median :0.0000 Median :12.00
## Mean :0.1537 Mean :11.52
## 3rd Qu.:0.1429 3rd Qu.:12.00
## Max. :1.0000 Max. :12.00
Re-checking my values again for NA detection.
There is no missing values on my dataset. Yey!
cc_data <- cc_data[, -1] #Removing my first column, will let me to play with this dataset correctly.## 'data.frame': 8950 obs. of 17 variables:
## $ BALANCE : num 40.9 3202.5 2495.1 1666.7 817.7 ...
## $ BALANCE_FREQUENCY : num 0.818 0.909 1 0.636 1 ...
## $ PURCHASES : num 95.4 0 773.2 1499 16 ...
## $ ONEOFF_PURCHASES : num 0 0 773 1499 16 ...
## $ INSTALLMENTS_PURCHASES : num 95.4 0 0 0 0 ...
## $ CASH_ADVANCE : num 0 6443 0 206 0 ...
## $ PURCHASES_FREQUENCY : num 0.1667 0 1 0.0833 0.0833 ...
## $ ONEOFF_PURCHASES_FREQUENCY : num 0 0 1 0.0833 0.0833 ...
## $ PURCHASES_INSTALLMENTS_FREQUENCY: num 0.0833 0 0 0 0 ...
## $ CASH_ADVANCE_FREQUENCY : num 0 0.25 0 0.0833 0 ...
## $ CASH_ADVANCE_TRX : int 0 4 0 1 0 0 0 0 0 0 ...
## $ PURCHASES_TRX : int 2 0 12 1 1 8 64 12 5 3 ...
## $ CREDIT_LIMIT : num 1000 7000 7500 7500 1200 1800 13500 2300 7000 11000 ...
## $ PAYMENTS : num 202 4103 622 0 678 ...
## $ MINIMUM_PAYMENTS : num 140 1072 627 0 245 ...
## $ PRC_FULL_PAYMENT : num 0 0.222 0 0 0 ...
## $ TENURE : int 12 12 12 12 12 12 12 12 12 12 ...
# Density Plot of frequency attributes
ggplot(cc_data ,aes(x=BALANCE_FREQUENCY))+
geom_density(alpha=0.5, fill = "darkgreen") +
labs(title="Balance Frequency") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
theme_minimal()ggplot(cc_data ,aes(x=ONEOFF_PURCHASES_FREQUENCY))+
geom_density(alpha=0.5, fill = "darkgreen") +
labs(title="Oneoff Purchases Frequency") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
theme_minimal()ggplot(cc_data ,aes(x=PURCHASES_FREQUENCY))+
geom_density(alpha=0.5, fill = "darkgreen") +
labs(title="Purchases Frequency") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
theme_minimal()ggplot(cc_data ,aes(x=PURCHASES_INSTALLMENTS_FREQUENCY))+
geom_density(alpha=0.5, fill = "darkgreen") +
labs(title="Purchases Installments Frequency") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
theme_minimal()ggplot(cc_data ,aes(x=CASH_ADVANCE_FREQUENCY))+
geom_density(alpha=0.5, fill = "darkgreen") +
labs(title="Cash Advance Frequency") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
theme_minimal()# Historgram for each non-frequency attribute
freq <- c("BALANCE_FREQUENCY", "ONEOFF_PURCHASES_FREQUENCY", "PURCHASES_FREQUENCY",
"PURCHASES_INSTALLMENTS_FREQUENCY", "CASH_ADVANCE_FREQUENCY")
cc_data %>%
select(-freq) %>%
gather(Attributes, value) %>%
ggplot(aes(x=value)) +
geom_histogram(fill = "lightblue2", color = "black", stat = "count") +
facet_wrap(~Attributes, scales = "free_x") +
labs(x = "Value", y = "Frequency",
title="Credit Cards Attributes - Histograms") +
theme(axis.text.x = element_text(colour = "grey0", size = 10, angle = 30, hjust = 0.5, vjust = 0.5),
axis.text.y = element_text(colour = "grey20", size = 10),
text = element_text(size = 12)) +
theme_minimal()## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(freq)` instead of `freq` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Warning: Ignoring unknown parameters: binwidth, bins, pad
corrplot(cor(cc_data), diag = FALSE, type = "upper", order = "hclust",
tl.col = "black", tl.pos = "td", tl.cex = 0.5, method = "circle")# Relationship between Balance and Cash Advance
ggplot(cc_data, aes(x=BALANCE, y=CASH_ADVANCE)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
labs(title="Credit Cards Attributes",
subtitle="Relationship between Balance and Cash Advance") +
theme_bw()# Normalization
cc_scaled <- as.data.frame(scale(cc_data)) #It helps us to normalise the data within a particular range. (Also scaling gives us fast processing.)
# Original data
p1 <- ggplot(cc_data, aes(x=CREDIT_LIMIT, y=BALANCE)) +
geom_point() +
labs(title="Original data") +
theme_bw()
# Normalized data
p2 <- ggplot(cc_scaled, aes(x=CREDIT_LIMIT, y=BALANCE)) +
geom_point() +
labs(title="Normalized data") +
theme_bw()
# Subplot
grid.arrange(p1, p2, ncol=2) PCA Analysis
res.pca <- prcomp(cc_scaled, scale = TRUE)
# Visualize eigenvalues/variances
fviz_screeplot(res.pca, addlabels = TRUE, ylim = c(0, 50))As shown graph above, 10 components cover nearly 80% of variance according to PCA analysis.
# Extract the results for variables
var <- get_pca_var(res.pca)
# Contributions of variables to PC1
fviz_contrib(res.pca, choice = "var", axes = 1, top = 10)# Control variable colors using their contributions to the principle axis
fviz_pca_var(res.pca, col.var="contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
) + theme_minimal() + ggtitle("Variables - PCA")Kmeans is a simple unsupervised learning algorithm to cluster dataset into groups. Even if the cluster number is not optimal for the dataset, the algorithm splits into the given number of groups. It can be got different results even if the same code runs, because of changing the starting point of the model. Therefore randomization is the essential point of the Kmeans algorithm as the same as other algorithms.
We can start four centers for our model, and let’s look at the results:
set.seed(96743) # because starting assignments are random
k <- kmeans(cc_scaled, centers=4, nstart = 25)## BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
## 1 -0.3186789 0.2474748 0.1129728 0.004276753
## 2 -0.2659988 -0.3707411 -0.3427854 -0.230659248
## 3 1.4592898 0.3849408 -0.2350014 -0.164145135
## 4 0.9626347 0.4634767 3.1639402 2.748625306
## INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## 1 0.2590371 -0.3649444 0.9868072
## 2 -0.3865510 -0.1829484 -0.7939753
## 3 -0.2541801 1.6872281 -0.5064753
## 4 2.4315805 -0.1559175 1.1341286
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 1 0.3233029 0.8767322
## 2 -0.3897275 -0.7102531
## 3 -0.2137049 -0.4515378
## 4 1.8033921 1.0604485
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT
## 1 -0.4607102 -0.3591435 0.3036958 -0.07308479
## 2 -0.1035736 -0.1658335 -0.4740003 -0.33458515
## 3 1.7485690 1.6192911 -0.2840458 0.83870197
## 4 -0.3254707 -0.1720256 3.0161122 1.43770090
## PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
## 1 -0.1357102 -0.08443574 0.3948791 0.06049096
## 2 -0.2623140 -0.12819077 -0.2582916 -0.05424328
## 3 0.6032523 0.49933830 -0.4062667 -0.09752305
## 4 1.9468103 0.49277411 0.4704433 0.32339259
clusplot(cc_data, k$cluster, color=TRUE, shade=TRUE,
labels=4, lines=0, main="K-means cluster plot")The figure shown above plotted by two components which cover 47.62% of the point variability.
## [1] 3363 3989 1197 401
There is a problem that must be solved because the distribution of the size between clusters seems unbalanced. There may be several reasons for this, we could start with control the number of clusters whether the cluster number is true.
There are different methods to find the optimal cluster number in the Kmeans algorithm. We focus on them in the next section to find the best cluster number.
There are several popular methods to find optimal clusters. We applied two methods listed below:
They are explained with their examples respectively below.
In the Elbow method, it is plotted a line chart using within-groups the sum of squares of the model which is applied the Kmeans algorithm with the range of different k values. Because a line plot seems like an arm, this method is named Elbow Method. SSE is decreasing as increasing k value because the distance between clusters decreases. Our goal is to choose k value corresponding to the “Elbow point” in the chart. This point is also the minimum SSE value for our model.
wss <- (nrow(cc_data)-1)*sum(apply(cc_data,2,var))
for (i in 2:15) {
wss[i] <- sum(kmeans(cc_data, centers=i)$tot.withinss)
}
plot(1:15, wss, type="b", pch = 19, frame = FALSE,
xlab="Number of Clusters",
ylab="Within groups sum of squares")fviz_nbclust(cc_scaled, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2)+
labs(subtitle = "Elbow method")According to the Elbow Method, the elbow point of the model seems seven cluster is the best choice.
In the Silhouette method, it is plotted a line chart like an Elbow Method using the range of different k values. While the Elbow Method tries to calculate the error between the clusters, the Silhouette Method tries to estimate the average distance between clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters. therefore, our goal is to decide k value corresponding to a higher average distance between clusters.
# function to compute average silhouette for k clusters
avg_sil <- function(k) {
km.res <- kmeans(cc_scaled, centers = k, nstart = 25, iter.max = 50)
ss <- silhouette(km.res$cluster, dist(cc_scaled))
mean(ss[, 3])
}
# Compute and plot wss for k = 2 to k = 15
k.values <- 2:15
# extract avg silhouette for 2-15 clusters
avg_sil_values <- map_dbl(k.values, avg_sil)## Warning: Quick-TRANSfer stage steps exceeded maximum (= 447500)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 447500)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 447500)
plot(k.values, avg_sil_values,
type = "b", pch = 19, frame = FALSE,
xlab = "Number of clusters K",
ylab = "Average Silhouettes")With a different visualization technique for Silhouette Method:
The seven clusters seems the optimal number for our model.
It seems seven clusters are good for our model.
# Execution of k-means with k=7
set.seed(1234)
seg.k7 <- kmeans(cc_scaled, centers=7)
# Mean values of each cluster
aggregate(cc_data, by=list(seg.k7$cluster), mean)## Group.1 BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
## 1 1 112.5092 0.3576877 306.9341 186.1401
## 2 2 1539.7193 0.9766219 1495.1614 781.7478
## 3 3 2976.6239 0.9503170 222.1891 164.9473
## 4 4 1247.6485 0.9661524 290.8402 230.5123
## 5 5 137.1797 0.9108183 1346.9436 656.8064
## 6 6 4151.3604 0.9844211 10046.3581 6738.1296
## 7 7 6103.8708 0.9734736 865.3894 520.8899
## INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## 1 121.03585 332.59923 0.2659229
## 2 713.63388 352.71995 0.9041611
## 3 57.27007 2697.05845 0.1596098
## 4 60.75994 449.08079 0.1843836
## 5 690.24556 48.36178 0.8648975
## 6 3310.79265 667.50401 0.9414595
## 7 344.66398 7172.18706 0.4116654
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 1 0.06771059 0.18979894
## 2 0.34916135 0.73319274
## 3 0.08581231 0.07719388
## 4 0.10001614 0.08261485
## 5 0.29078703 0.67504441
## 6 0.75859883 0.78802772
## 7 0.19474781 0.30083949
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT
## 1 0.038265000 0.7522412 4.022005 3616.470
## 2 0.064245784 1.2518519 27.009150 4501.043
## 3 0.418021648 9.6332518 3.016300 5154.418
## 4 0.099665131 1.7315191 3.277823 3016.450
## 5 0.008449436 0.1438980 20.985428 4743.488
## 6 0.068019923 1.9615385 104.559829 10718.590
## 7 0.569196821 21.0835381 13.395577 9804.791
## PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
## 1 1058.4390 125.3299 0.18090335 11.28851
## 2 1584.6602 984.7496 0.05019256 11.74205
## 3 1769.4853 1177.4802 0.03052005 10.95355
## 4 846.0003 694.3229 0.02345691 11.70024
## 5 1443.8112 179.4812 0.77294468 11.38525
## 6 9475.3428 2379.7085 0.31674990 11.95299
## 7 6190.6427 2806.5320 0.05055040 11.63882
## [1] 1227 2295 1227 2462 1098 234 407
# Clustering
ggpairs(cbind(cc_data, Cluster=as.factor(seg.k7$cluster)),
columns=1:6, aes(colour=Cluster, alpha=0.5),
lower=list(continuous="points"),
upper=list(continuous="blank"),
axisLabels="none", switch="both") +
theme_bw()## [1] 1227 2295 1227 2462 1098 234 407
The aim was to segment customers to define a marketing strategy. Below you can see the user groups that we have segmented according to the clusters we have reserved.
Cluster 1 - Subclass user group: They are the users with the lowest cash. Credit limits are also low. They don’t buy often. They do not perform banking transactions too much and their balances are not updated frequently.
Cluster 2 - Balanced middle-class user group: Users who are in the 2nd largest group in this regard when they are considered as one-time purchases, with no significant differences between cash and spending expenses. Installment amounts are above average and it is the second largest group in this regard. The number of purchases is quite high. They are in Group 2 in this field as well. They have an average credit limit.
Cluster 3 - User group with the smallest expenditures: They have the third highest balance. They work in advance. They have an average credit limit. Credit card service periods are the lowest.
Cluster 4 - Small Spending and User Group with the Lowest Credit Limit: These users are included in the group with the lowest credit limit, but they do not buy too much. They are the largest customer group.
Cluster 5 - Credit card lovers group: It is the lowest cash-buying group. They make an average of 80% of their repayments. This group is the third smallest in number and they do not keep their money in the bank.
Cluster 6 - User group with big expenditures: It is the second user group with the highest cash. They make expensive purchases and have the highest credit limit. It is the smallest customer group. 7. Users in the group are candidates for entering this group. Loyalty bank applications can be implemented to increase spending.
Cluster 7 - Frugal user group w/ money: It is the group with the highest cash. They generally like to trade with cash advance. They pay attention to their balances and expenses at the bank. It is the group that makes the highest payment in terms of reimbursement. Marketing transactions can be carried out in order to ensure that the transactions made in cash are returned to the credit card.